--- title: "Plotting - ggplot2" --- Plotting - ggplot2

Intro

~2 Minute Setup

Create your R Notebook for today and double check that your workspace is clear from last time.

Load the important packages for today:

#' You only need to run this once
install.packages("scales")
install.packages("ggthemes")
library("ggplot2")
library("languageVariationAndChangeData")
library("tidyverse")
library("scales")
library("ggthemes")

Note You can find a reference for all the bits and pieces in ggplot2 here: http://ggplot2.tidyverse.org/reference/

Also, it’s easy to make mistakes, but don’t get too frustrated. Celebrate them! https://twitter.com/accidental__art

Why Plot

We are having a lesson on plotting in the middle of a course on R modelling because it is essential for you to plot your data before you try to model it. I would go so far as to say that if you haven’t made a lot of graphs of your data, and have only looked at averages, correlations, and linear model results, that you don’t really understand your data.

There’s a classic illustration of this called Anscombe’s quartet, which when plotted looks like three very distinctive patterns.

But if you fit linear models to them, they have nearly identical statistical properties.

fit_lm <- function(df){
  lm(y ~ x, data = df)
}
anscomb_models <- tidy_anscombe %>%
                    group_by(series)%>%
                    nest()%>%
                    mutate(model = map(data, fit_lm),
                           model_param_df = map(model, tidy),
                           model_glance = map(model, glance))
anscomb_models %>%
  unnest(model_param_df)%>%
  arrange(term)
anscomb_models %>%
  unnest(model_glance)

It has been even more humorously illustrated recently that you can produce data sets of almost any arbitrary shape that have nearly identical statistical properties.

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Thinking about Plotting

It’s important to think of your figures as a report of your data. Try to take as much care in producing your plots as you do your writing, or reporting of your statistics. They are as important as (or for some readers, more important than) anything else in your paper.

“Accuracy”

When making a plot, you should strive for accuracy in:

  • Accurately representing the properties of numbers.
  • Accurately representing the nature of your data.

Take this very simple data set:

group value
A 2
B 5

For our purposes, these numbers have three properties.

  1. Order: 2 < 5, or A < B
  2. Magnitude: 5 = 2.5 \(\times\) 2, or B = 2.5 \(\times\) A
  3. Contextual Magnitude: If A and B are bars, and these are measure of the cost of a pint, then A must be a real dive (and a good deal), and B must be a little bit better, but still not too fancy. If A and B are people, and these are their number of legs, then A has an unsurprising number of legs and B has a surprising number of legs.

Here is an example of an inaccurate plot:

It successfully captures the order of A and B, but fails to capture the correct magnitude of the difference. The magnitude of the difference is thrown off because the y-axis doesn’t start at 0. In this plot, the B line is 7\(\times\) longer than the A line, but the actual magnitude of the difference is 2.5\(\times\). This produces a “lie factor” of \(\frac{7}{2.5} = 2.8\).

This isn’t just a hypothetical problem either. For example, British electoral mailers are notorious for the inaccurately portraying the magnitude of differences.

Both academic researchers and the producers of these political mailers may counter by saying

ggplot2 basic concepts

Layers, Aesthetics, Geometries and Statistics

The first thing we’re going to do is build up to creating this plot, which has one point for each speaker in the buckeye corpus, and plots their Monomorphemic retention rate against their past tense retention rate.

And then we’ll build this plot:

Layers

You should hopefully start looking at figures like this one like many of us look at the image below.

Those of use familiar with this kind of media know that the picture of the libarary is not what was originally capture by my phone. Rather there are multiple layers of effects, filters and text on top of the base image, which produce the final image. And in fact, some of these layers are crucially ordered. For example, the text would look different if it was added to the image first, and then the filters, instead of vice versa.

So too with the ggplot2 plot above. These plots are constructed out of layers. Every component of the graph, from the underlying data it’s plotting, to the coordinate system it’s plotted on, to the statistical summaries overlaid on top, to the axis labels, are layers in the plot. The consequence of this is that your use of ggplot2 will probably involve iterative addition of layer upon layer until you’re pleased with the results.

Aesthetics

The graphical properties which encode the data you’re presenting are the aesthetics of the plot. These include things like

  • x position
  • y position
  • size of elements
  • shape of elements
  • color of elements

Geometries

The primary visual items on the plots are called geometries and include things like

  • points
  • lines
  • line segments
  • bars
  • text

Some of these geometries have their own specific aesthetic settings. For example,

  • points
    • point shape
  • text
    • text labels
  • lines
    • line weight
    • line type

Statistics

You’ll also frequently want to plot statistics overlaid on top of, or instead of the raw data. Some of these include

  • Smoothing and regression lines
  • One and two dimensional binning
  • Mean and medians with confidence intervals.

The aesthetics, geometries and statistics constitute the most important layers of a plot, but for fine tuning a plot for publication, there are a number of other things you’ll want to adjust. The most common one of these are the scales, which encompass things like

  • A logarithmic x or y axis
  • Customized color scales
  • Customized point shapes, or linetypes

We’ll review many of these components as we build up the plot, and will circle back to more of them for greater detail.

Building the Plot

First, let’s refresh our memories of the graph we want to build.

This plot is composed of ten layers, which can be subdivided into five layer types. It’s not important for you to memorize these layer types, but it helps to structure the discussion.

The data layer

Every ggplot2 plot has a data layer, which defines the data set to plot, and the basic mappings of data to aesthetic elements. The data layer created with the functions ggplot() and aes(), and looks like this

ggplot(data, aes(...))

The first argument to ggplot() is a data frame (it must be a data frame), and its second argument is aes(). You’re never going to use aes() in any other context except for inside of other ggplot2 functions, so it might be best not to think of aes() as its own function, but rather as a special way of defining data-to-aesthetic mappings.

So, obviously, we first need to make the data frame we want to plot with, which we can do with a quick split-apply-combine, then spread:

(retention <- buckeye %>%
              group_by(Gram2, Speaker)%>%
              summarise(td = mean(td))%>%
              spread(Gram2, td))

With this data, we’re going to map the mono retention rates to the x-axis, and the past retention rates to the y axis, which we can do like so:

p <- ggplot(retention, aes(x = mono, y = past))
p

You can think of this plot as the base image, before we’ve added any extra layers, text or instagram filters to it. An important conceptual issue is that you are able to assign plots to variables (in this case, p). When you do this assignment, nothing special happens. But if you print out p, R will generate the plot.

The geometries layer

The next step, after defining the basic data-to-aesthetic mappings, is to add geometries to the data. We’ll discuss geometries in more detail below, but for now, we’ll add one of the simplest: points.

  p <- p + geom_point()
  p

There are a few things to take away from this step. First and foremost, the way you add new layers, of any kind, to a plot is with the + operator. And, as we’ll see in a moment, there’s no need to only add them one at a time. You can string together any number of layers to add to a plot, separated by +.

The next thing to notice is that all layers you add to a plot are, technically, functions. We didn’t pass any arguments to geom_point(), so the resulting plot represents the default behavior: solid black circular points.

If for no good reason at all we wanted to use a different point shape in the plot, we could specify it inside of geom_point().

ggplot(retention, aes(x = mono, y = past)) +
  geom_point(shape = 3)

Or, if we wanted to use larger, red points, we could specify that in geom_point() as well.

ggplot(retention, aes(x = mono, y = past))+
  geom_point(color = "red", size = 3)

Speaking of defaults, we can see a few of the default setting of ggplot2 on display here. Most striking is the light grey background, with white grid lines. Opinion varies on whether or not this is aesthetically or technically pleasing, but don’t worry, it’s adjustable.

Another default is to label the x and y axes with the column names from the data frame. I’ll inject a bit of best practice advice here, and tell you to always change the axis names. It’s nearly guaranteed that your data frame column names will make for very poor axis labels. We’ll cover how to do that shortly.

Finally, note that we didn’t need to tell geom_point() about the x and y axes. This may seem trivial, but it’s a really important, and powerful aspect of ggplot2. When you add any layer at all to a plot, it will inherit the data-to-aesthetic mappings which were defined in the data layer. We’ll discuss inheritance, and how to override, or define new data-to-aesthetic mappings within any geom.


Let’s come back to our current plot:

p

I want to add an additional geom that is just a diagonal line with an intercept of 0 and a slope of 1. That is, a line that indicates where mono == past

p <- p + geom_abline(intercept = 0, slope = 1)
p

The Statistics Layer

The final figure also includes a smoothing line, which is one of many possible statistical layers we can add to a plot.

  p <- p + stat_smooth()
  p

Cosmetic alterations

Finally, I wanted to make some cosmetic adjustments to the plot. For example, the x-axis label “mono” is not quite as useful as “Monomorphemes” would be. I also adjusted the y and x limits, added a title, changed the default “theme” from the grey background, and made it so that the plot has a square aspect ratio.

p <- p + 
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()
p

And that’s basically the whole show, except I changed the color of the points and lines to some that I like. Here’s the full code to make that whole plot:

buckeye %>%
  group_by(Speaker, Gram2) %>%
  summarise(td = mean(td))%>%
  spread(Gram2, td)%>%
  ggplot(aes(mono, past))+
    geom_point(color = "#41817F",
               alpha = 0.8)+
    geom_abline(color = "#0D4D4B",
                linetype = 2)+
    stat_smooth(color = "#0D4D4B")+
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()

Making the bar plot.

Again, we need to start out making the data set we want to work with:

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td))

Then, we can start adding it to a plot. First, just the data layer.

Then, we can add geom_col() (“col” for “column”) to the graph.

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col()

Things aren’t quite in the order we want them, and we can fix that with xlim().

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col()+
    xlim("past", "semiweak", "mono")

Now, we can add some colors to the columns:

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")

These default colors aren’t great. So I’ll use a different color palette from the ggthemes package (more info)for now:

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")+
    scale_fill_few()+
    theme_minimal()

The automatically created legend needs some work though:

We should also adjust the y-axis to run up to 1, and add horizontal line to emphasize the bottom of the graph:

buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")+
    scale_fill_few(name = "Grammatical Class",
                   limit = c("past", "semiweak", "mono"))+
    theme_minimal()+
    ylim(0,1)+
    geom_hline(yintercept = 0)

---
title: "Plotting - ggplot2"
output: 
  html_notebook: 
    code_folding: none
    css: custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---


# Intro


<div class = "box break">
<span class = 'big-label'>~2 Minute Setup</span>

Create your R Notebook for today and double check that your workspace is clear from last time.

Load the important packages for today:

```{r eval = F}
#' You only need to run this once
install.packages("scales")
install.packages("ggthemes")
```



```{r}
library("ggplot2")
library("languageVariationAndChangeData")
library("tidyverse")
library("scales")
library("ggthemes")
```

</div>

**Note** You can find a reference for all the bits and pieces in ggplot2 here: [http://ggplot2.tidyverse.org/reference/](http://ggplot2.tidyverse.org/reference/)

Also, it's easy to make mistakes, but don't get too frustrated. Celebrate them! [https://twitter.com/accidental__art](https://twitter.com/accidental__art)



## Why Plot

We are having a lesson on plotting in the middle of a course on R modelling because it is *essential* for you to plot your data before you try to model it. I would go so far as to say that if you haven't made a *lot* of graphs of your data, and have only looked at averages, correlations, and linear model results, that you don't really understand your data. 

There's a classic illustration of this called Anscombe's quartet, which when plotted looks like three very distinctive patterns.

```{r echo = F, dev = 'svg'}
tidy_anscombe <- anscombe %>%
                    mutate(idx = 1:n()) %>%
                    gather(key, value, x1:y4)%>%
                    separate(key, into = c("variable", "series"), sep = 1)%>%
                    spread(variable, value)

tidy_anscombe %>%
  ggplot(aes(x, y)) + 
    geom_point() + 
    facet_wrap(~series)
```

But if you fit linear models to them, they have nearly identical statistical properties.

```{r}
fit_lm <- function(df){
  lm(y ~ x, data = df)
}
anscomb_models <- tidy_anscombe %>%
                    group_by(series)%>%
                    nest()%>%
                    mutate(model = map(data, fit_lm),
                           model_param_df = map(model, tidy),
                           model_glance = map(model, glance))

anscomb_models %>%
  unnest(model_param_df)%>%
  arrange(term)
```
```{r}
anscomb_models %>%
  unnest(model_glance)
```

It has been even more humorously illustrated recently that you can produce data sets of almost any arbitrary shape that have nearly identical statistical properties.

![](figures/DinoSequentialSmaller.gif)
[Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing](https://www.autodeskresearch.com/publications/samestats)

## Thinking about Plotting

It's important to think of your figures as a *report* of your data. Try to take as much care in producing your plots as you do your writing, or reporting of your statistics. They are as important as (or for some readers, more important than) anything else in your paper.


## "Accuracy"

When making a plot, you should strive for accuracy in:

- Accurately representing the properties of numbers.
- Accurately representing the nature of your data.

Take this very simple data set:

<div style = "width:50%">

| group | value |
| ----: | ----: |
| A | 2 |
| B | 5 |

</div>

For our purposes, these numbers have three properties.

1. **Order**: 2 < 5, or A < B
2. **Magnitude**: 5 = 2.5 $\times$ 2, or B = 2.5 $\times$ A
3. **Contextual Magnitude**: If A and B are bars, and these are measure of the cost of a pint, then A must be a real dive (and a good deal), and B must be a little bit better, but still not too fancy. If A and B are people, and these are their number of legs, then A has an unsurprising number of legs and B has a surprising number of legs.

Here is an example of an inaccurate plot:

```{r fig.width = 5/2, fig.height = 5/2, echo = F}
num <- data.frame(group = c("A", "B"),
                  value = c(2, 5))

ggplot(num, aes(group, value))+
    geom_segment(aes(xend = group, y=1.5, yend=value), size=10)+
    theme_minimal()
```

It successfully captures the *order* of A and B, but fails to capture the correct *magnitude* of the difference. The magnitude of the difference is thrown off because the y-axis doesn't start at 0. In this plot, the B line is `r (5-1.5)/(2-1.5)`$\times$ longer than the A line, but the actual magnitude of the difference is 2.5$\times$. This produces a "lie factor" of $\frac{7}{2.5} = `r 7/2.5`$.

This isn't just a hypothetical problem either. For example, British electoral mailers are notorious for the inaccurately portraying the magnitude of differences.

![](figures/inaccurate.png)


```{r echo = F}
scot <- data_frame(party = c("Conservative",
                             "SNP",
                             "Lib Dem",
                             "Labour"),
                   mps = c(1, 7, 12, 39))
ggplot(scot, aes(party, mps, fill = party))+
  geom_bar(stat = "identity", color = "black")+
  xlim(c("Conservative",
         "SNP",
         "Lib Dem",
         "Labour"))+
  scale_fill_manual(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"),
                    values = c("#0087dc",
                               "#FFF95D",
                               "#FDBB30",
                               "#d50000"))+
    theme_minimal()
```

Both academic researchers and the producers of these political mailers may counter by saying





# `ggplot2` basic concepts

## Layers, Aesthetics, Geometries and Statistics
The first thing we're going to do is build up to creating this plot, which has one point for each speaker in the `buckeye` corpus, and plots their Monomorphemic retention rate against their past tense retention rate.




```{r echo = F, message = F}
buckeye %>%
  group_by(Speaker, Gram2) %>%
  summarise(td = mean(td))%>%
  spread(Gram2, td)%>%
  ggplot(aes(mono, past))+
    geom_point(color = "#41817F",
               alpha = 0.8)+
    geom_abline(color = "#0D4D4B",
                linetype = 2)+
    stat_smooth(color = "#0D4D4B")+
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()
```

And then we'll build this plot:

```{r echo = F}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td))%>%
  filter(Gram2 %in% c("past", "semiweak", "mono")) %>%
  ggplot(aes(Gram2, td, fill = Gram2))+
    geom_bar(stat = "identity", color = "black")+
    geom_hline(yintercept = 0)+
    xlim("past", "semiweak", "mono")+
    ylim(0,1)+
    xlab("Grammatical Class")+
    ylab("Retention Rate")+
    scale_fill_few(name = "Grammatical Class", limit = c("past", "semiweak", "mono"))+
    theme_minimal()+
    ggtitle("TD Retention Rates")
      
```


### Layers

You should hopefully start looking at figures like this one like many of us look at the image below.

<div class = 'half-img'>
![](figures/Enlight9.jpg)
</div>

Those of use familiar with this kind of media know that the picture of the libarary is not what was originally capture by my phone. Rather there are multiple layers of effects, filters and text on top of the base image, which produce the final image. And in fact, some of these layers are crucially ordered. For example, the text would look different if it was added to the image first, and then the filters, instead of vice versa.


So too with the `ggplot2` plot above. These plots are constructed out of __layers__. Every component of the graph, from the underlying data it's plotting, to the coordinate system it's plotted on, to the statistical summaries overlaid on top, to the axis labels, are layers in the plot. The consequence of this is that your use of `ggplot2` will probably involve iterative addition of layer upon layer until you're pleased with the results.


### Aesthetics

The graphical properties which encode the data you're presenting are the __aesthetics__ of the plot. These include things like

- x position
- y position
- size of elements
- shape of elements
- color of elements


### Geometries

The primary visual items on the plots are called __geometries__ and include things like

* points
* lines
* line segments
* bars
* text

Some of these geometries have their own specific aesthetic settings. For example,

* points
    * point shape
* text
    * text labels
* lines
    * line weight
    * line type
  
### Statistics

You'll also frequently want to plot __statistics__ overlaid on top of, or instead of the raw data. Some of these include

* Smoothing and regression lines
* One and two dimensional binning
* Mean and medians with confidence intervals.


----

The __aesthetics__, __geometries__ and __statistics__ constitute the most important __layers__ of a plot, but for fine tuning a plot for publication, there are a number of other things you'll want to adjust. The most common one of these are the __scales__, which encompass things like

* A logarithmic x or y axis
* Customized color scales
* Customized point shapes, or linetypes

We'll review many of these components as we build up the plot, and will circle back to more of them for greater detail.



# Building the Plot

First, let's refresh our memories of the graph we want to build.

```{r echo = F, message = F}
buckeye %>%
  group_by(Speaker, Gram2) %>%
  summarise(td = mean(td))%>%
  spread(Gram2, td)%>%
  ggplot(aes(mono, past))+
    geom_point(color = "#41817F",
               alpha = 0.8)+
    geom_abline(color = "#0D4D4B",
                linetype = 2)+
    stat_smooth(color = "#0D4D4B")+
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()
```

This plot is composed of ten layers, which can be subdivided into five layer types. It's not important for you to memorize these layer types, but it helps to structure the discussion.


## The data layer

Every `ggplot2` plot has a data layer, which defines the data set to plot, and the basic mappings of data to aesthetic elements. The data layer created with the functions `ggplot()` and `aes()`, and looks like this

```{r eval=F}
ggplot(data, aes(...))
```

The first argument to `ggplot()` is a data frame (it _must_ be a data frame), and its second argument is `aes()`. You're never going to use `aes()` in any other context except for inside of other `ggplot2` functions, so it might be best not to think of `aes()` as its own function, but rather as a special way of defining data-to-aesthetic mappings.

So, obviously, we first need to *make* the data frame we want to plot with, which we can do with a quick split-apply-combine, then spread:

```{r}
(retention <- buckeye %>%
              group_by(Gram2, Speaker)%>%
              summarise(td = mean(td))%>%
              spread(Gram2, td))
```


With this data, we're going to map the `mono` retention rates to the x-axis, and the `past` retention rates to the y axis, which we can do like so:

```{r}
p <- ggplot(retention, aes(x = mono, y = past))
p
```



You can think of this plot as the base image, before we've added any extra layers, text or instagram filters to it. An important conceptual issue is that you are able to assign plots to variables (in this case, `p`). When you do this assignment, nothing special happens. But if you print out `p`, R will generate the plot. 


## The geometries layer

The next step, after defining the basic data-to-aesthetic mappings, is to add geometries to the data. We'll discuss geometries in more detail below, but for now, we'll add one of the simplest: points.

```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5}
  p <- p + geom_point()
  p
```

There are a few things to take away from this step. First and foremost, the way you add new layers, of any kind, to a plot is with the `+` operator. And, as we'll see in a moment, there's no need to only add them one at a time. You can string together any number of layers to add to a plot, separated by `+`.

The next thing to notice is that all layers you add to a plot are, technically, functions. We didn't pass any arguments to `geom_point()`, so the resulting plot represents the default behavior: solid black circular points.

If for no good reason at all we wanted to use a different point shape in the plot, we could specify it inside of `geom_point()`.

```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5, tidy = F}
ggplot(retention, aes(x = mono, y = past)) +
  geom_point(shape = 3)
```

Or, if we wanted to use larger, red points, we could specify that in `geom_point()` as well.
```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5, tidy = F}
ggplot(retention, aes(x = mono, y = past))+
  geom_point(color = "red", size = 3)
```

Speaking of defaults, we can see a few of the default setting of `ggplot2` on display here. Most striking is the light grey background, with white grid lines. Opinion varies on whether or not this is aesthetically or technically pleasing, but don't worry, it's adjustable. 

Another default is to label the x and y axes with the column names from the data frame. I'll inject a bit of best practice advice here, and tell you to _always_ change the axis names. It's nearly guaranteed that your data frame column names will make for very poor axis labels. We'll cover how to do that shortly.

Finally, note that we didn't need to tell `geom_point()` about the x and y axes. This may seem trivial, but it's a really important, and powerful aspect of `ggplot2`. When you add any layer at all to a plot, it will __inherit__ the data-to-aesthetic mappings which were defined in the data layer. We'll discuss inheritance, and how to override, or define new data-to-aesthetic mappings within any geom.



---

Let's come back to our current plot:

```{r}
p
```

I want to add an additional geom that is just a diagonal line with an intercept of 0 and a slope of 1. That is, a line that indicates where `mono` == `past`

```{r}
p <- p + geom_abline(intercept = 0, slope = 1)
p
```

## The Statistics Layer

The final figure also includes a smoothing line, which is one of many possible statistical layers we can add to a plot.
```{r fig.width = 8/1.5, fig.height=5/1.5}
  p <- p + stat_smooth()
  p
```

## Cosmetic alterations
Finally, I wanted to make some cosmetic adjustments to the plot. For example, the x-axis label "mono" is not quite as useful as "Monomorphemes" would be. I also adjusted the y and x limits, added a title, changed the default "theme" from the grey background, and made it so that the plot has a square aspect ratio.

```{r tidy = F,fig.width = 8/1.5, fig.height=5/1.5, warning = F, message  =F }
p <- p + 
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()
p
```




And that's basically the whole show, except I changed the color of the points and lines to some that I like. Here's the full code to make that whole plot:


```{r  message = F}
buckeye %>%
  group_by(Speaker, Gram2) %>%
  summarise(td = mean(td))%>%
  spread(Gram2, td)%>%
  ggplot(aes(mono, past))+
    geom_point(color = "#41817F",
               alpha = 0.8)+
    geom_abline(color = "#0D4D4B",
                linetype = 2)+
    stat_smooth(color = "#0D4D4B")+
    ylim(0,1)+
    xlim(0,1)+
    xlab("Monomorphemes")+
    ylab("Past Tense")+
    ggtitle("TD Retention Rates")+
    theme_minimal()+
    coord_fixed()
```


# Making the bar plot.

Again, we need to start out making the data set we want to work with:

```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td))
```


Then, we can start adding it to a plot. First, just the data layer.

```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))
```


Then, we can add `geom_col()` ("col" for "column") to the graph.

```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col()
```


Things aren't quite in the order we want them, and we can fix that with `xlim()`.

```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col()+
    xlim("past", "semiweak", "mono")
```


Now, we can add some colors to the columns:

```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")
```


These default colors aren't *great*. So I'll use a different color palette from the `ggthemes` package ([more info](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html))for now:


```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")+
    scale_fill_few()+
    theme_minimal()
```


The automatically created legend needs some work though:


```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")+
    scale_fill_few(name = "Grammatical Class",
                   limit = c("past", "semiweak", "mono"))+
    theme_minimal()
```


We should also adjust the y-axis to run up to 1, and add horizontal line to emphasize the bottom of the graph:


```{r}
buckeye %>%
  group_by(Gram2) %>%
  summarise(td = mean(td)) %>%
  filter(Gram2 %in% c("past", "semiweak", "mono"))%>%
  ggplot(aes(Gram2, td))+
    geom_col(aes(fill=Gram2), color = "black")+
    xlim("past", "semiweak", "mono")+
    scale_fill_few(name = "Grammatical Class",
                   limit = c("past", "semiweak", "mono"))+
    theme_minimal()+
    ylim(0,1)+
    geom_hline(yintercept = 0)
```

